In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly
from datetime import datetime

Exploring Price Data¶

The purpose of this project is to predict the price to hire an event space in London based on the characteristics of that space e.g. location, area, facilities etc. In order to do this, we have scraped all London based event spaces from the Tagvenue website.

We have scraped 2 separate datasets:

  • event space dataset -> The event space dataset has one row per event space. Each row records the characteristics of a space e.g. location, area, facilities etc.
  • price dataset -> Records the various hire prices for each space. It is relatively complicated and has multiple rows per event space. An example table of prices for a single space is shown below:

image

To predict the hire prices, we will need to choose what price to actually predict! If an event space has multiple different prices, what price do we use as a target variable?

Objective¶

  1. Understand the multiple prices associated with each space - why are there multiple rows of prices per space? How can we understand each price?
  2. Choose what prices we will use as a target variable and how we will model these prices (e.g. separate model per day of week or something like that).

We are primarily interested in modelling minimum spend and hire fee price types - as such we will focus on these price types when exploring the data.

Setup¶

In [3]:
pd.options.display.max_rows = 500
pd.options.display.max_columns = 0
plotly.offline.init_notebook_mode()

Importing Data¶

Below we import the cleaned prices data as dataframes. Metadata for these tables can be found in the repo Readme.

In [167]:
prices = pd.read_csv('../data/cleaned_data/tag_venue_space_prices_25-Aug-22.csv')
venues = pd.read_csv('../data/cleaned_data/tag_venue_space_data_25-Aug-22.csv')
In [168]:
prices.head(10)
Out[168]:
space_url venue_url venue_name space_name latitude longitude day_of_week time_period_desc time_period time_from time_to time_length total_price price_type rent_price_split min_spend_split per_person_split
0 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Monday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
1 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Tuesday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
2 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Wednesday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
3 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Thursday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
4 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Friday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
5 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Saturday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
6 https://www.tagvenue.com/rooms/london/19171/th... https://www.tagvenue.com/venues/london/2034/th... the golden hinde entire ship 51.507014 -0.090456 Sunday Per hour 6.0 - 0.0 6.0 0.0 18.0 360.0 hire fee per hour NaN NaN NaN
7 https://www.tagvenue.com/rooms/london/21285/go... https://www.tagvenue.com/venues/london/9456/go... goodenough college events & venue hire large common room-wgh 51.524406 -0.117326 Monday Per day 9.0 - 17.0 9.0 17.0 8.0 1740.0 hire fee NaN NaN NaN
8 https://www.tagvenue.com/rooms/london/21285/go... https://www.tagvenue.com/venues/london/9456/go... goodenough college events & venue hire large common room-wgh 51.524406 -0.117326 Monday Per day 9.0 - 17.0 9.0 17.0 8.0 67.0 per person NaN NaN NaN
9 https://www.tagvenue.com/rooms/london/21285/go... https://www.tagvenue.com/venues/london/9456/go... goodenough college events & venue hire large common room-wgh 51.524406 -0.117326 Tuesday Per day 9.0 - 17.0 9.0 17.0 8.0 1740.0 hire fee NaN NaN NaN

Understanding Prices¶

Approach¶

The first thing we need to understand is why venues have multiple rows of price data. The way to understand this is relatively simple - we need to determine what combination of columns in the price data uniquely define each price row. This is a bit of a mouthful. What we are saying is, imagine a space with 30 rows of prices. We analyse the rows and notice that the combination of 2 columns, let's say day_of_week (Monday, Tuesday, Wednesday etc.) and price_type (hire fee, per person, minimum spend etc.), is always different between the rows. For example, the first row might represent the 'Monday - hire fee' price, and the next row represents the 'Tuesday - per person' price etc. We surmise that the combination of these 2 columns defines a separate category that each price belongs to. This category defines what the price represents. For example, a 'Monday - hire fee' combination means this is the price on a Monday if we want to book using a hire fee pricing method and the 'Tuesday - per person' combination means this is the price on a Tuesday if we want to book using a per person pricing method. Then we look at all spaces and price rows combined. We can separate the price rows into buckets for each category of prices e.g. we will have a bucket for 'Tuesday - per person' prices which is a single price per space to book the space on a Tuesday using the per person pricing method. Now we can understand and categorise the multiple prices per spaces into meaningful groups, where each group has just a single price per space and therefore a single target variable to predict.

We can make some intuitive guesses for the columns that define the prices. Firstly, we have the day_of_week variable (showing the weekday of that price e.g. the price to hire the space on a Monday). We see many duplicated rows where only the day_of_week variable is changing, usually with all 7 days of the week included. Thus, we can conclude that the day_of_week probably categorises the prices e.g. we have prices on different days of the week.

Secondly, we have price_type. This tells us what type of pricing is being used e.g. 'minimum spend', 'per person', 'hire fee' etc. We shall guess that the prices rows are uniquely defined by the combination of day_of_week and price_type e.g. The 'Monday - Hire Fee' price and the 'Tuesday - Per Person price' etc.

Testing Hypothesis: day_of_week + price_type¶

We can test our guess for what columns define the price category. We do this by grouping all the rows of the prices data according to the combination of space_url (which acts as a unique id for each space), day_of_week and price_type. Then we count the members of each group. If each group has only a single member, then we only have a single price per combination of space_url, day_of_week and price_type which means our price categories are defined by day_of_week and price_type.

Below we perform this test and show a value count on the count per group. If we are correct, then the count of rows per group should always be 1 (each price uniquely defined by this combination of columns) and thus the value count will only have one row.

In [169]:
assumed_grouping = prices.groupby(['space_url', 'day_of_week', 'price_type'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')

count_per_group.value_counts()
Out[169]:
1    28820
2     6899
3     2139
4      648
5       12
7        7
Name: total_price, dtype: int64

We anticipated that a value count would return a single row i.e. all groups would have size 1. Clearly that is not the case, we see that using our grouping we are not always defining individual prices but are sometimes defining groups of prices, up to 7 in a group.

The above test showed that there are more factors that define each price category in the dataset than day_of_week and price_type. We did some exploring of the data and realised that time_period_desc, the column that verbally describes a time period e.g. 'all day' or 'per hour' was also used to define price categories. For example, a space may have multiple prices on a Monday that are 'hire fee' price types. The 2 prices are differentiated by their time period descriptions. One may be the price 'per hour' and the other is the price for 'all day'.

Below we quantify the fraction of the prices and spaces that require time_period_desc to define different price categories. Note that from the value count above, we know that 28820 prices do not require time_period_desc to uniquely categorise each row, and thus we can simply subtract this from the total number of prices to get the number of prices that are defined by time period description:

In [170]:
total_prices = prices.shape[0]
# subtract the first row of value count above from all price data count
num_prices_using_period = total_prices - 28820

# get access to multi index columns  
count_per_group = count_per_group.reset_index()
# gets space urls of all spaces that used time period desc to define prices
urls_using_time_period = count_per_group[
    count_per_group.total_price > 1
].space_url

num_spaces_using_period = urls_using_time_period.nunique()
total_spaces = prices.space_url.nunique()

print(f"{num_prices_using_period} prices out of {total_prices} total prices -> {num_prices_using_period / total_prices:.0%}")
print(f"{num_spaces_using_period} spaces out of {total_spaces} total spaces -> {num_spaces_using_period / total_spaces:.0%}")
22916 prices out of 51736 total prices -> 44%
1569 spaces out of 4761 total spaces -> 33%

We can see that one third of all spaces have used time period description to define their prices and 44% of our data is categorised by the time period description. We clearly need to include time period description in our exploration.

Testing Hypothesis: day_of_week + price_type + time_period_desc¶

To make sure that time period description is the only variable we were missing, we will repeat the above and group the data according to the 4 columns we have identified and see if these uniquely define each price:

In [171]:
assumed_grouping = prices.groupby(['space_url', 'day_of_week', 
                                   'time_period_desc', 'price_type'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')

count_per_group.value_counts()
Out[171]:
1    45058
2     2613
3      414
4       42
6        7
Name: total_price, dtype: int64

Once again, we find that there is another variable that categorises prices. Some more exploring found that the actual time period itself e.g. '12:00 - 14:00' could also define separate prices in the data as shown below:

alt text

As before, we will quantify the number of spaces and prices that are require time_period to be uniquely categorised:

In [172]:
# subtract the first row of value count above from all price data count
num_prices_using_period = total_prices - 45058

# get access to multi index columns  
count_per_group = count_per_group.reset_index()
# gets space urls of all spaces that used time period to define prices
urls_using_pricing_period = count_per_group[
    count_per_group.total_price > 1
].space_url

num_spaces_using_period = urls_using_pricing_period.nunique()

print(f"{num_prices_using_period} prices out of {total_prices} total prices -> {num_prices_using_period / total_prices:.0%}")
print(f"{num_spaces_using_period} spaces out of {total_spaces} total spaces -> {num_spaces_using_period / total_spaces:.0%}")
6678 prices out of 51736 total prices -> 13%
511 spaces out of 4761 total spaces -> 11%

We can see its a much smaller amount of data that is priced in this way, only 11% of spaces and 13% of the price data. We will need to investigate the time periods as well and see how they impact the prices and our regression.

Testing Hypothesis: day_of_week + price_type + time_period_desc + time_period¶

We will once again repeat the grouping analysis, this time using all 5 variables identified and see if they uniquely categorise all prices.

In [173]:
assumed_grouping = prices.groupby(['space_url', 'day_of_week', 'time_period_desc', 
                                   'price_type', 'time_period'])
# count number of members in each group
count_per_group = assumed_grouping['total_price'].agg('count')

count_per_group.value_counts()
Out[173]:
1    51736
Name: total_price, dtype: int64

Conclusions¶

The above shows that the 5 columns space_url, day_of_week, time_period_desc, price_type and time_period uniquely categorise every price in the prices data (every row).

This means that we can understand each row in the prices data as defining a price for the combination of day_of_week, time_period_desc, price_type and time_period. For example, look again at some price data below:

image

The total_price in the first row of £1740 is the hire fee price to book the event on a Monday on a per day basis between 9.00 - 17.00.

The next row down has a total_price of £67 which is the per person price to book the event on a Monday on a per day basis between 9.00 - 17.00.

The combination of these 4 columns (day_of_week, time_period_desc, price_type and time_period) categorises each price row in the prices data. Every different combination of these 4 columns defines a separate bucket of prices. We will assume for now that these buckets must be treated separately. For example, all rows in the example data above must be placed into separate models as totally separate prices. We can only predict hire fee + Monday + per day + 9.00 - 17.00 prices from other hire fee + Monday + per day + 9.00 - 17.00 prices.

Time Period Exploration¶

We will now explore the implications of prices being categorised by time_period. We chose to explore time period first because its a little more complicated. It has a huge array of possible values (basically any time period you can think of) whereas the other columns that define prices have only a limited number of options (e.g. day_of_week has 7 values, Monday to Sunday). This creates some challenges for modelling. The most important is that having many different time_period values creates a huge number of price categories, each containing a small number of prices and thus having a small number of observations to use in a model.

The first question we want to answer is, how many prices actually vary as a result of having a different time_period. This is subtly different from the categorisation analysis above. Above we wanted to understand why a venue had multiple price rows and what each row represented. Now we are asking the question, how many of the prices actually vary as a result of a change in time period? Does time period actually affect the price or does price not really depend on time period?

We have noticed that spaces often have multiple rows of prices that are differentiated only by time_period. However, they often don't actually change the price itself. For example, in the image below, all of the 'Friday - per session - per person' prices have the same price (£20) for lots of different time periods. The time period is not actually changing or setting the price at all.

alt text

Below, we group all the prices according to space_url, day_of_week, time_period_desc and price_type (we are excluding time_period). Then we count the number of unique prices in each group (so for instance, the 'Friday - per session - per person' example above would form a group with 6 prices and 1 unique price which is £20). We then print a value count of this count of unique prices per group. This will indicate how many groups prices vary with time period (count of unique prices more than 1) versus how many do not vary with time period (count of unique prices equals 1).

In [174]:
# Group spaces by 'space_url', 'day_of_week', 'time_period_desc', 'price_type' 
# and count number of unique prices. If there is only 1 unique price, then 
# the time period is not varying the total price, if it is more than 1 then it is.
num_of_time_period_prices_per_group = (
    prices.groupby(['space_url', 'day_of_week', 
                    'time_period_desc', 'price_type'], 
                   as_index = False
                  )['total_price'].nunique()
)

num_of_time_period_prices_per_group.rename(columns = {'total_price': 'number_unique_prices'},
                                           inplace = True)
num_of_time_period_prices_per_group['number_unique_prices'].value_counts()
Out[174]:
1    45781
2     2001
3      326
4       26
Name: number_unique_prices, dtype: int64

Observations:

The vast majority of prices (45781 groups) do not actually vary by time period and clearly only a small fraction of groups actually vary by time period (~2350). This means that for most prices, although the prices have been defined according to certain time periods, they do not actually tend to vary the price or be important for determining the price. Below we will quantify the number of spaces whose price vary with time period.

In [175]:
# Find total spaces by counting number of unique space urls
total_spaces = num_of_time_period_prices_per_group.space_url.nunique()
# filter on groups with > 1 unique prices
time_period_groups = (
    num_of_time_period_prices_per_group[
        num_of_time_period_prices_per_group.number_unique_prices > 1
    ]
)
# Count unique space urls
total_time_period_spaces = time_period_groups.space_url.nunique()

print(f"{total_time_period_spaces} spaces out of {total_spaces} "
      + f"vary with time period -> {total_time_period_spaces/total_spaces: .2%}")
405 spaces out of 4761 vary with time period ->  8.51%

Conclusions:

405 spaces have prices that vary with time period, accounting for 8.5% of all our spaces. This is a small fraction of our data. This means we are quite happy to simply drop all price rows which vary according to time period, we think this will only moderately reduce the number of observations we can use to model.

We have decided to drop time period varying rows of data because trying to include time period in our models introduces some unwanted complexity:

  • Most prices do not vary by time period, so how do you combine the varying prices with the non varying prices? If kept in the same model, the large number of duplicated prices (e.g. the £20 example above) may lead to overfitting / over confidence and the varying prices with time period trends will be drowned out by the non varying prices. If you split into separate models, there is only a fairly small sample of 405 max observations to train the varying data on.
  • We could keep some of the varying time period prices by aggregating over the varying prices in some way so as to maintain these observations in a model e.g. take the average or max or minimum price. This would allow us to keep more price data for modelling. It would however be manually tampering with our training data and its unclear what impact this would have on the validity of our model.
  • We suspect that venues pay little attention to time period when putting their data on Tagvenue. As such, we suspect it will be an unreliable predictor of price.

Finally, it is worth noting that dropping all groups whose data varies with time period will not drop 405 spaces from our dataset. It will only drop rows of data whose price varies due to the time period - so if a space has some rows which do not vary with time period, that space will not be dropped from the data (it will however be dropped from certain price categories).

Remove prices that vary with time_period¶

Now we will drop all groups (grouped by space_url, day_of_week, time_period_desc and price_type) from the prices data where the total_price varies according to time_period. The easiest way to map back our groups above and associated count of unique prices is to merge the prices df with num_of_time_period_prices_per_group df and to use the 'number_unique_prices' column to drop these rows specifically.

In [176]:
# merge dfs based on combinations of 'space_url', 'day_of_week', 
# 'time_period_desc' and 'price_type'. 
prices_cleaned = pd.merge(prices, num_of_time_period_prices_per_group, 
                          on = ['space_url', 'day_of_week', 
                                'time_period_desc', 'price_type'], 
                          how = 'left')

prices_cleaned.number_unique_prices.value_counts()
Out[176]:
1    46545
2     4094
3      993
4      104
Name: number_unique_prices, dtype: int64

Now we drop all rows where the price is dependent upon the time_period i.e. where the count of unique prices value is more than 1.

In [177]:
# filter out prices whose number_unique_prices is > 1
prices_cleaned = prices_cleaned[prices_cleaned.number_unique_prices == 1]
prices_cleaned.drop(axis =1, columns = ['number_unique_prices'], inplace = True)

Finally, we remove duplicates on the space_url, day_of_week, time_period_desc and price_type to remove all rows which are duplicated in their values of the 3 columns but have the same price (i.e. were categorised by time_period but didnt actually vary in price due to time period). Once this is done, every single row remaining in the prices data will map to a separate price category defined by the combination of day_of_week, time_period_desc and price_type.

In [178]:
prices_cleaned.drop_duplicates(['space_url', 'day_of_week', 
                                'time_period_desc', 'price_type'], 
                               inplace = True)
prices_cleaned = prices_cleaned.reset_index(drop = True)

To illustrate more clearly what we have done, take the old price data example below:

alt text

We had 6 rows with 'Friday', 'Per session', 'per person' for their 'day_of_week, 'time_period_desc' and 'price_type' columns respectively. They all had a total cost of £20 and were separated by their time_period, which varied for each row. We have removed the 'time_period' as a category for our prices. What this has done is collapsed the above example into a single price that is categorised as 'Friday - Per session - per person' price with a total cost of £20.

Below we briefly quantify the amount of data that we lost:

In [179]:
rows_remaining = prices_cleaned.shape[0]
original_rows = prices.shape[0]
total_spaces_original = total_spaces
total_spaces = prices_cleaned.space_url.nunique()

print(f"We dropped {original_rows - rows_remaining} rows of price data, "
      + f"from {original_rows} to {rows_remaining} rows of data")
print(f"We lost {total_spaces_original - total_spaces} spaces, from {total_spaces_original} to {total_spaces} total spaces")
We dropped 5955 rows of price data, from 51736 to 45781 rows of data
We lost 168 spaces, from 4761 to 4593 total spaces

As you can see, we lost only a small amount of data and 168 spaces.

Visualise Pricing Categories¶

We now have prices that are categorised by the combination of day_of_week, time_period_desc and price_type. Each unique combination of these 3 columns defines a separate bucket of prices and our initial assumption is that they must be treated separately i.e. we can only predict 'Monday - per day - hire fee' prices from other 'Monday - per day - hire fee' prices.

To understand how practical it would be to create separate models per pricing category, we need to understand how many prices are present in each category - i.e. how many observations would we have to create each model? We are hoping to get at least ~1000+ observations per model to get high prediction power.

Below we visualise the count of prices for each price category (all combinations of day_of_week, time_period_desc and price_type). Note that each bar in the chart represents the count of prices for a different pricing category.

In [180]:
fig = px.histogram(prices_cleaned, x="price_type", facet_row = 'day_of_week',
                   color = 'time_period_desc', barmode='group', height = 1300,
                  title = 'Count of Prices per Price Category')

fig.show()

Observations: - Note, we will be focusing on categories using the 'Hire Fee' and 'min. spend' price_type because that is our primary focus for this project.

'Hire Fee' and 'min. spend' observations:

  • The categories with price_type = 'hire fee' and and time_period_desc = 'per day' always has around 1500 observations (except for weekends where it has just under 1000). These categories have plenty of observations to build a decent model from. We would be most interested in the 'per day' cost of venue hire.
  • The categories with price_type = 'hire fee' and time_period_desc is not 'per day' have far fewer observations, max ~ 400. We are not particularly interested in these less popular time periods and would be happy to just model the 'per day' prices.
  • The categories with price_type = 'min. spend' have a maximum number of observations of ~ 750. The most popular time_period_descs are 'Per evening' and 'per session', which generally have around 700 observations each (per session always having more). We would ideally want a few more observations per model than ~700. The 'per session' value for time_period_desc is likely problematic because it is not obviously mutually exclusive to the other time_period_desc options. 'per day', 'per evening', 'per morning' and 'per afternoon' are all clearly mutually exclusive but any one of them could be replaced with the vague 'per session'. We suspect that when venue owners have uploaded spaces to Tagvenue, they have used 'per session' interchangeably with the other time_period_descs. Thus, we suspect the time_period_desc may be a poor predictor of price and perhaps we could remove it and merge the observations into a single set that is larger than 1000 and is not categorised by time_period_desc.

Other Observations

  • The price_type = 'hire fee per hour' only has a single time_period_desc which is 'per hour', which makes sense. The 'per hour' time_period_desc is not used by other price_types.
  • The categories using a combined price_type (e.g. 'hire fee + min. spend') have very few observations and can be ignored.
  • Across all price_types we see a fairly constant number of observations on weekdays and we see a drop on weekends. Perhaps venues tend to offer thier space for hire in the week but prefer walk ins at weekends where they probably see more customers.
  • The typical trends for different combinations of price_type and time_period_desc are repeated across the different weekdays, suggesting most places offer prices across all 7 days of the week.

Explore Time_Period_Desc¶

 Motivations¶

Our plot above suggests we need some more observations / prices for our 'min. spend' models. One way to achieve this would be to remove time_period_desc from the definition of the price categories (i.e. collapse all 'min. spend' price categories per weekday into a single category).

This would introduce complications when spaces have multiple prices set by the time_period_desc i.e. if a space has a £30 price for min. spend for a time_period_desc of 'per session' on a Monday and a £40 price for a min. spend for a time_period_desc of 'per evening' on a Monday. In this situation, how would we select a price to use to train our model?

There are several possible solutions:

  • We could pool together the different time_period_description buckets of prices with the same price_type and day_of_week variable e.g. pool all prices that are 'hire fee - Monday' into a single category of prices. We could then use time_period_desc as a categorical variable. This way, our model could discern between 2 prices at the same venue with the same price_type and day_of_week but separated by time_period_desc. E.g. a venue may have 2 'hire fee - Monday' prices, each differentiated by having a different time_period_desc e.g. 'per day' and 'per morning'. The benefit of this approach is that it would preserve all observations, we would lose no data for modeling. The draw back would be based on how good time_period_desc is as a predictor of price. We suspect that prices are not strongly determined by time_period_desc and therefore it could be a poor predictor. We would need to use it as a predictor otherwise we would end up with no variable distinguishing between some prices.
  • We could use some aggregation mechanism e.g. take the average price or lowest price etc. This would cause issues because its unclear how tampering with our training data and target variable like this would impact the validity of the model.
  • We could simply drop all rows of data where this overlap occurs. The negative impact of this would depend on how much data we end up losing or gaining (ideally we want to gain over 1000 observations per model by collapsing over time_period_desc, hopefully this isnt undermined by dropping overlapping rows)

Our preference would be to drop the problematic data. We will need to quantify the number of spaces / prices that are affected by being varied by multiple time_period_descs (for the same space, price_type and day_of_week).

Visualise time_period_desc price overlaps¶

Below we group our prices by space_url, day_of_week and price_type and then count the number of unique prices per group. We then perform a value_counts on this count to show the relative number of groups that are not impacted by this issue (count of unique prices = 1) to the number of prices that are impacted by this (count of unique prices > 1):

In [181]:
num_prices_if_collapsed = prices_cleaned.groupby(['space_url', 'day_of_week', 
                         'price_type'], as_index = False)['total_price'].nunique()

num_prices_if_collapsed.rename(columns = {'total_price':'num_prices'}, inplace = True)

num_prices_if_collapsed.num_prices.value_counts()
Out[181]:
1    30626
2     4758
3      857
4       25
Name: num_prices, dtype: int64

We can see once again that the majority of groups (30626) would not be impacted by this issue compared with ~5500 which would be. This is a good sign.

We will now plot the count of num_prices per group when they are split into the new price category groups we are proposing (combinations of day_of_week and price_type). This is a little bit confusing, so we will try and clarify. The below plot splits the num_prices_if_collapsed data into the new buckets of price categories that are based only on the day_of_week and price_type columns. Then, within those categories, we plot the count of different num_prices (from 1 to 4) for that price category. This count of different num_prices shows us the number of prices within that price category with only a single price once time_period_desc is removed from the category definition (num_prices = 1). It also shows us the number of prices within that price category with multiple prices once time_period_desc is removed from the category definition (num_prices > 1). The count of num_prices = 1 tells us the number of observations a model predicting that price category would have available for training (assuming we simply dropped the prices which have multiple prices based on time_period_desc).

In [182]:
fig2 = px.histogram(num_prices_if_collapsed, x="price_type", 
                    facet_row = 'day_of_week', color = 'num_prices',
                    barmode='group', height = 1300, 
                   title = 'Count of different num_prices values per new price category')

fig2.show()

Hire Fee and Min. Spend Observations:

  • For price_type = 'min. spend', the removal of time_period_desc is very successful, it has given us a total of ~1200 observations per weekday which is a good amount for our model. We are losing about 300 groups of price data per weekday but we think this is a good tradeoff.
  • For price_type = 'hire fee', we see that the count of num_prices = 1 is basically identical to the hire fee number of observations in the previous chart where time_period_desc is 'per day'. Its around 1500 on weekdays and about 1000 on weekends. This is a little surprising. It suggests that removing time_period_desc from the price category definition has not increased the number of available observations for modelling this group! This is presumably because spaces that use the 'per day' time_period_desc tend to have only a single price whereas spaces using other time_period_descs (e.g. 'per session') tended to have multiple different prices and thus tend to not fall into the num_prices = 1 bucket.

Other Observations:

  • For price_type = 'per person', we also see an improvement in the number of observations available.
  • For price_type = 'hire fee per hour' we see no change since it only ever had a single 'time_period_desc' and hence is unaffected by this change (it was always effectively independent of time_period_desc).

Visualise Predictive Value of time_period_desc¶

We are exploring removing time_period_desc from the definition of price categories. We want to assess the predictive power of the time_period_desc. Do prices categorised into different time_period_desc's have very different price distributions? This may suggest they behave quite differently and so should be modeled separately. It also will give us an idea of the predictive power of using time_period_desc as a categorical variable - does it clearly sort the prices into different price ranges?

We will answer the above by splitting the prices according to price category (defined by day_of_week, price_type and time_period_desc) and then plot a box plot of each price category.

Analysis Note - The box plots have quite different ranges and thus the default box plots are very small and hard to read. We used the interactive functionality of plotly individually zoom in on the 'min Spend' and 'Hire fee' price types (zooming on one plot will zoom the rest, making it very easy)

In [183]:
fig1 = px.box(prices_cleaned, x="price_type", y = 'total_price', 
              facet_row = 'day_of_week', color = 'time_period_desc',
              height = 1300, range_y = (0, 20000),
              title = 'Distribution of total_cost per Price Category')
fig1.show()

Min. Spend Observations:

A screenshot of a typical zoom in to price_type = 'min. spend' is shown below: image

  • This distribution is typically very similar for all days of the week.
  • The 'per day' and 'per evening (red and green respectively) prices have very similar widths and distributions. We find 'per evening' always has a lower median than 'per day', which makes sense since a day is a longer time and so would be expected to be more expensive. It is however not much more expensive typically, probably because evenings are in higher demand.
  • 'per session' (purple) typically has a smaller spread and lower median than both 'per day' and 'per evening'. As we suggested before, we think spaces would have used 'per session' interchangeably with 'per day', 'per evening', 'per morning' and 'per afternoon'. As such, the median may have been brought down by morning and afternoon time periods being included (since they always have the lowest medians). The spread may have also been reduced for the same reason i.e. morning and afternoon have smaller spreads. Per session can be seen as a merging of the other time_period_descs
  • The inherent unreliability of the 'per session' time_period_desc will cause problems for using the time_period_desc as a categorical variable since it appears to be a messy combination of the other time periods.
  • Overall, the distribution of price for different time period descriptions are not significantly different. There is a general downward trend i.e. typically lower prices and less variation from left to right (per day to per morning). It doesn't look like time_period_desc is a particularly powerful predictor.

Hire Fee Observations:

A screenshot of a typical zoom in to price_type = 'hire fee' is shown below: image

  • The trend shown above is repeated across all weekdays, except for weekends, where we see increased distribution widths and medians.
  • The 'per day' plot (red) has a noticeably thinner distribution width than the 'per evening' (green) plot and the 'per session' plot (purple).
  • On weekends, the 'per day' plot (red) looks almost identical to the 'per session' plot. We can understand the 'per session' plot as a mixture of all the other plots (due to the 'per session' label being vague and interchangeable). On weekdays it looks like an average of the 'per day' (red) and 'per evening' (green) distributions, which have either the most observations ('per day') or highest typical values ('per evening') and thus would be expected to dominate influencing the average.
  • Unlike for min. spend, we see that 'per evening' (green) always has a larger median and wider width than 'per day' (red). This is the opposite of the 'min. spend', implying evening event hire is worth more for an evening than for a day.
  • The 'per morning' (orange) and 'per afternoon' (blue) plots have very similar distributions - perhaps they could be combined into one group predicted by time_period_desc e.g. a morning / afternoon group.
  • The distributions are a bit more varied between one another than for 'min spend'. This could suggest we would be better to model different time_period_descs separately.

Re-Classify 'Per Session' Time Period Description¶

Motivation¶

Our analysis above identified an issue with the 'time_period_desc' Per session. We believe 'Per session' could have been used interchangeably with the remaining mutually exclusive time_period_descs 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening'. For Example, a space may have used the time_period_desc 'per morning' or 'per session' for the same price. Thus, we suspect that 'per session' has become a catch all for 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening' time_period_descs and is thus fairly meaningless.

In this section, we will explore re-classifying the 'per session' prices into 'Per day', 'Per Morning', 'Per Afternoon' and 'Per Evening'. This could be another solution to increasing the number of observations available for a 'min. spend' model without having to merge all time_period_desc prices together.

Visualise Time_period_descs¶

We will use the 'time_from' and 'time_to' columns to reclassify the 'per session' time_period_desc. The 'time_from' column records what time the space will be hired from (it is in 24 hour time and stored as an int so 0 is midnight, 9 is 9.00 am, 14 is 2.00 pm etc.) and the 'time_to' column records when the hire period ends. E.g. a time_from of 6 and time_to of 12 means the spaces will be hired from 6.00 am until 12.00 pm. Thus, these 2 columns specifically define the actual time period the space is being hired for at this price. The time_period_desc column provides a written description of this time period e.g the previous example would probably be a 'Per morning' price.

In order to re-classify the 'Per session' prices, we need to see if there are characteristic time periods for each time_period_desc e.g. does 'Per day' have a characteristic time period that we can use to clearly classify a price as 'Per day'? To answer this question, we will visualise the time periods by plotting time_from against time_to as separate scatter plots for each time_period_desc:

In [184]:
px.scatter(prices_cleaned, x = 'time_from', y = 'time_to', 
           facet_col = 'time_period_desc', facet_col_wrap = 3, 
           height = 600, opacity = 0.01, title = 'time_to vs time_from for each time_period_desc')

Observations

  • We can ignore 'per hour' because that is only relevant for 'hire fee per hour' which we don't care about in this project.
  • We can clearly see that 'Per day', 'Per evening', 'Per morning' and 'Per afternoon' all have characteristic and very different time periods - each one has time periods clustering in different locations of the time_from vs time_to plot. The dark blue areas denote the characteristic time period areas for each time_period_desc.
  • 'Per day' has some overlap with 'Per evening', 'Per morning' and 'Per afternoon'. The color is pale blue in the areas it overlaps, so these seem like outliers / perhaps mis-classifcations or mistakes.
  • 'Per session' shows dark blue that overlaps with 'Per day', 'Per evening', 'Per morning' and 'Per afternoon'. This suggest that we were correct in assuming 'per session' has been used interchangeably with the other time_period_descs. Its characteristic time period areas looks like the result of combining the other time_period_desc plots into a single plot. ### Identify Characteristic Time Periods In the below image, we show the time_from vs time_to plot for per session with the markers categorised into 'Per day', 'Per evening', 'Per morning' and 'Per afternoon'. We defined the categories by eye i.e. by visually inspecting and identifying the characteristic regions and cutoffs for each time_period_desc. Note - we could have used a categorisation algorithm to classify the time_period_descs using the time_to and time_from data. Due to time limitations, we chose not to.

image

This clearly shows how we would re-classify the 'Per session' prices into the remaining time_period_descs.

Re-classify 'Per session'¶

Ideally we would take the decision boundaries / characteristic areas defined in the above plot and use them to categorise the 'per session' prices into the other time_period_descs. However, we did not know how to perform this 'area' categorisation quickly in Pandas. To save time, we decided to convert the 2d time period plots above into a 1d representation that we could easily define bins to re-classify into different time_period_descs.

The 1d time period representation we settled on used the following equation:

$-4time\_from + time\_to$

This created a 1d number that took into account the differences between the time_from and time_to times and thus effectively grouped the different time_period_descs into different regions of this 1d representation. The distribution of different time_period_descs using this 1d representation are shown below:

In [185]:
prices_cleaned['_1d_time_period'] = (-4 * prices_cleaned.time_from
                                    + prices_cleaned.time_to)

px.histogram(prices_cleaned, x = '_1d_time_period', 
            color = 'time_period_desc', height = 400, 
            title = 'Distribution of 1d time period representation for different time_period_descs')

Note - for easier interpretation, we suggest you de-select 'Per hour' and 'Per session' by toggling them in the key to better see the distribution of 'Per day', 'Per evening', 'Per morning' and 'Per afternoon'.

We can see that, like in our 2d time_from vs time_to plots, that in this 1d representation the different time_period_descs have different characteristic values. We can use these to classify the 'per session' prices into the other 'time_period_descs'.

Based on this plot, we defined the bins for each time_period_desc using the 1d representation as follows:

  • -100 to -40 = Per Evening
  • -40 to -25 = Per Afternoon
  • -25 to -19 = Per Morning
  • -19 to 10 = Per Day

We will create a new column 'time_period_desc_fixed' that will store the time_period_descs after re-classifying the 'per session' prices. We will keep the original time_period_desc column so we can use either in future, depending on how useful this exercise was.

In [186]:
prices_cleaned['time_period_desc_fixed'] = prices_cleaned.time_period_desc

Define function to re-classify 'per session' time_period_descs based on its 1d representative time period.

In [187]:
def fix(time_period_desc, _1d_time_period):
    """Classify 'Per session' time_period_desc based on its 1d_time_period"""
    
    # Ignore time_period_descs that are not 'Per session'
    if time_period_desc != 'Per session': 
        return time_period_desc
    # Choose new time_period_desc based on 1d_time_period value
    if _1d_time_period < -40: 
        return 'Per evening'
    if _1d_time_period < -25: 
        return 'Per afternoon'
    if _1d_time_period < -19: 
        return 'Per morning'
    
    return 'Per day'

Perform re-classification of 'per session' prices

In [188]:
prices_cleaned.time_period_desc_fixed = [fix(time_period_desc, _1d_time_period) 
                                         for time_period_desc, _1d_time_period 
                                         in zip(prices_cleaned.time_period_desc,
                                                prices_cleaned._1d_time_period)]

We now need to perform some checks. As a result of the re-classification, we may have classified a 'per session' price as a different price (e.g. 'Per day') when a space already has a 'Per day' price. If the space has 2 prices within the same price category, we can't tell which price should be used in the model. Firstly, we will check if this has occurred:

In [189]:
num_of_time_period_prices_per_group = (
    prices_cleaned.groupby(['space_url', 'day_of_week', 
                    'time_period_desc_fixed', 'price_type'], 
                   as_index = False
                  )['total_price'].count()
)

num_of_time_period_prices_per_group.rename(columns = {'total_price': 'number_unique_prices'},
                                           inplace = True)
num_of_time_period_prices_per_group['number_unique_prices'].value_counts()
Out[189]:
1    44699
2      541
Name: number_unique_prices, dtype: int64

We have found 541 prices which have been double defined with the same time_period_desc. Since this is a very small fraction, we will simply mark them as 'NA' in the 'time_period_fixed' column and can thus easily ignore them in any analysis that uses the fixed time_period_descs.

In [190]:
# merge dfs based on combinations of 'space_url', 'day_of_week', 
# 'time_period_desc' and 'price_type'. 
prices_cleaned = pd.merge(prices_cleaned, num_of_time_period_prices_per_group, 
                          on = ['space_url', 'day_of_week', 
                                'time_period_desc_fixed', 'price_type'], 
                          how = 'left')

prices_cleaned.number_unique_prices.value_counts()
Out[190]:
1    44699
2     1082
Name: number_unique_prices, dtype: int64

Now we update time_period_desc_fixed to be 'NA' when the re-classification of 'per session' has created conflicting time_period_descs within the same space.

In [191]:
prices_cleaned.time_period_desc_fixed = np.where(prices_cleaned.number_unique_prices > 1, 
                                                 'NA', prices_cleaned.time_period_desc_fixed)
prices_cleaned.time_period_desc_fixed.value_counts()
Out[191]:
Per day          17689
Per evening      11102
Per hour          6206
Per afternoon     5586
Per morning       4116
NA                1082
Name: time_period_desc_fixed, dtype: int64

Now we will re-create our plot showing the count of observations / prices per price category (i.e. the number of observations available per separate price category):

In [192]:
fig = px.histogram(prices_cleaned, x="price_type", facet_row = 'day_of_week',
                   color = 'time_period_desc_fixed', barmode='group', height = 1300,
                  title = 'Count of Prices per Price Category')

fig.show()

Observations

  • For the price_type = 'min. spend' we have increased the typical 'Per evening' number of prices from ~600 to ~850 and have increased the 'Per day' from ~350 to ~500. This is a good increase.
  • For price_type = 'hire fee' we have increased typical 'Per day' prices from ~1500 to ~1550, a relatively small increase.
  • For price_type = 'hire fee' we also see about 130 prices per category have been classed as 'NA' meaning the space offered a 'Per session' price and another time_period_desc price and the 'Per session' price has been re-categorised as that other time_period_desc. This may suggest the re-categorisation of 'Per session' may not have made sense for 'Hire fee'.

We are still short of 1000 observations per model with the 'min. spend' price categories.

We will now re-create the box plots showing price distributions for different price categories:

In [193]:
fig1 = px.box(prices_cleaned, x="price_type", y = 'total_price', 
              facet_row = 'day_of_week', color = 'time_period_desc_fixed',
              height = 1300, range_y = (0, 20000),
              title = 'Distribution of total_cost per Price Category')
fig1.show()

Observations:

  • The typical distributions for the remaining time_period_descs for both 'min. spend' and 'hire fee' look very similar to before re-classifying 'per session'.
  • The new plots for 'min. spend' make it very clear that the 'Per day' and 'Per evening' box plots look very similar in terms of width and medians across the week (as they did in the previous box plot). We also find 'Per morning' and 'Per afternoon' look very similar also. This suggests to us that the prices are probably similar and follow similar patterns and so can merged together into the same model.
  • The 'hire fee' plots show a noticeable difference between 'Per day' and 'Per evening' and 'Per afternoon / morning' in terms of widths, whiskers and medians. As such, we think it is best to keep the 'Per day' prices in a separate model.

Weekday Analysis¶

Motivation¶

We noticed that many spaces have a single non-varying price across all days of the week e.g. a space may have a 'hire fee - per day' price that is £20 on all 7 days of the week (Monday to Sunday). We assume that some spaces will have varying prices across the week (e.g. perhaps there are higher prices on a Friday when there is more demand). It should be noted that the pricing on Tagvenue.com is probably not particularly reliable when looking at high precision. We assume that when only a single price is provided across the week that this is a minimum price or ball park price and that in reality there probably is variation across the week that the space has not included when uploading to Tagvenue.com.

In this section we want to understand the following:

  • How many spaces have prices that vary across the week?
  • Are there price tends across the week that we can model / derive insights from?
  • Will we attempt to model weekday variation in price?

In the following sections we will answer these questions via analysis of our pricing data. We have a choice of defining the pricing categories using the original 'time_period_desc' column or the new 'time_period_desc_fixed' column. We plan on building datasets for models using both of these columns, so perhaps we should show analysis using both definitions. We decided to only perform analysis using the 'time_period_desc_fixed' column because:

  • We are most interested in min. spend modeling, which will use the time_period_desc_fixed column
  • The min. spend price categories were significantly changed using the time_period_desc_fixed column within the price category definition, whereas the hire fee was much less significantly changed. Thus, we expect to see more similar hire fee results using time_period_desc_fixed than looking at min.spend using time_period_desc.

 How many prices vary across the week?¶

We will now quantify the fraction of prices per pricing category that vary across the week (i.e. how many prices within a price category show multiple different prices e.g. £30 on Monday and £50 on Tuesday etc.). We will do this using a groupby to count the number of unique prices per pricing category across a week. We will then visualise this data.

In [194]:
num_prices_per_week = prices_cleaned.groupby(['space_url', 'price_type', 
                                              'time_period_desc_fixed'], 
                                             as_index = False)['total_price'].nunique()

num_prices_per_week.rename(columns = {'total_price':'num_prices'}, inplace = True)

px.histogram(num_prices_per_week, x = "time_period_desc_fixed", color = 'num_prices', 
             facet_row = 'price_type', height = 1300, 
             title = 'Count of prices per category, showing proportion of prices that vary during the week' )

Observations:

  • For 'hire fee', varying prices account for a small fraction of the data. In the 'per day' time_period_desc, the vast majority (1413 out of ~1600) have only a single price across the week.
  • 'hire fee' does not appear to have a lot of varying prices, there will be few observations to model from.
  • For 'min. spend', we see that almost half of the 'Per evening' prices vary across the week, suggesting that perhaps we can look at including week day variation in the pricing.
  • The 'per day' time_period_desc for 'min. spend' shows a smaller proportion of prices that vary through the week (maybe a third or less) but also a higher proportion than for 'hire fee'.

 Exploring Price Variation Per Weekday¶

A fairly significant chunk of the 'min. spend' prices vary throughout the week. We will now filter on these varying prices and try and find any weekday pricing trends.

We start by merging our previous groupby with our pricing data (so we can access the count of prices per week for every single price) and then can filter out the prices that have only 1 value across the week.

In [195]:
week_varying_prices = pd.merge(prices_cleaned, num_prices_per_week,
                          on = ['space_url', 
                                'time_period_desc_fixed','price_type'], 
                          how = 'left')

week_varying_prices.num_prices.value_counts(dropna = False)
Out[195]:
1    35827
2     6584
3     1673
4      841
5      635
6      158
7       63
Name: num_prices, dtype: int64
In [196]:
# Filrer prices with > 1 price per week
week_varying_prices = week_varying_prices[week_varying_prices.num_prices > 1]

Now we will count the number of prices that are varying in each price category - this is so we can see exactly how many observations there are per category to get a sense of the statistical power of our findings (if there are relatively few observations then we can surmise any trends we find trends are not particularly reliable).

In [197]:
px.histogram(week_varying_prices, x="time_period_desc_fixed", color = 'day_of_week',
            facet_row = 'price_type', height = 1300, barmode='group', 
             title = 'Count of prices per price category (for prices that vary during week only)')

Observations:

  • 'Hire fee' price categories clearly have fewer varying prices than 'min. spend' overall and relatively few observations per price category. The most it has is about ~200 observations for the 'Per day' time period. This means any trends we see may not be be representative of a wider sample.
  • 'min. spend' has got a reasonable number of observations, especially for 'Per evening' which has around ~450. This suggest we can put a bit more faith in any trends we find being applicable more widely.
  • We can see an interesting trend in observations for 'min. spend - Per evening', the number of observations increases from Monday to mid-week (Thursday) and then decreases again towards the weekend, with the least popular day being Sunday. This suggests 'min. price' events in the evening are most available mid-week and least available over the weekend (Friday, Saturday, Sunday). This may be because spaces would prefer walk in / normal bar operation of their space on the weekends where there is high demand and use the venue hire to supplement incomes in the week.

Now we understand our number of observations per pricing category, we will look for trends in weekday pricing. We will start by plotting boxplots showing the price distribution for 'Min. Spend' and 'Hire Fee' price categories across the week. This gives us a simple but powerful look for average trends in the distributions.

In [198]:
varying_prices_min_spend = week_varying_prices[week_varying_prices.price_type == 'min. spend']

px.box(varying_prices_min_spend, x="price_type", y = 'total_price', 
       facet_row = 'time_period_desc_fixed', color = 'day_of_week', height = 1300, 
       range_y = (0,20000), 
       title = 'Boxplots of price distribution for Min. Spend price categories')

Observations:

  • For both 'Per day' and 'Per evening' time periods, we see a gradual increase in median / width of the box plots suggesting peak prices occur on Friday or Saturday, as you might expect.
  • The box plots suggest there is some kind of trend in prices over the week that we could possibly model or at least extract some simple averages from (e.g. a Friday price may typically be twice as much as a Monday price or something like that).
In [199]:
varying_prices_hire_fee = week_varying_prices[week_varying_prices.price_type == 'hire fee']

px.box(varying_prices_hire_fee, x="price_type", y = 'total_price', facet_row = 'time_period_desc_fixed',
       color = 'day_of_week', height = 1300, range_y = (0,12000),
      title = 'Boxplots of price distribution for Hire Fee price categories')

Observations:

  • In contrast to 'min. spend', we see a less significant trend in prices. The Monday to Thursday prices have remarkably similar box plots, suggesting that there is no typical price change for these days. We then see a small rise in medians / width on Friday to Sunday, suggesting a typical increase in these prices.

So it looks like we can see some typical weekday trends. We have only seen these in box plots, and these plots only show averages, they give us a sense of what might be going on. We would like to dive a bit deeper and look at typical weekday multipliers. For each space and each price category, we want to understand what days of the week each space increases the price and what this relative increase is (i.e. 1.5 times, 2 times etc.). In this way, we can see explicitly if most spaces tend to increase prices on the same days of the week.

Below, we perform this analysis by grouping every price according to the specific space, price type and time period. This group will have a single price per weekday e.g. a Monday price, a Tuesday price etc. We will then find the minimum weekday price and divide each weekday price by this price. E.g. if the prices Monday to Saturday are all £40 and the Sunday price is £80 then we will identify £40 as the minimum price and after dividing each price by 40 we will have a value of 1 for every weekday apart from Sunday which has a value of 2 i.e. Sunday is twice as expensive as the other days. We will call this metric the minimum price multiplier i.e. a representation of each price in terms of how much higher it is than the lowest weekday price.

Below we perform this analysis and plot the distribution of minimum price multipliers per weekday for min. Spend and Hire Fee as boxplots:

In [200]:
week_varying_prices['min_price_multiplier'] = (
    week_varying_prices.groupby(['space_url', 
                                 'price_type', 
                                 'time_period_desc_fixed'], 
                                as_index = False)['total_price'].transform(lambda x: x / min(x))
)

varying_prices_min_spend = week_varying_prices[week_varying_prices.price_type == 'min. spend']

fig3 = px.box(varying_prices_min_spend, x="price_type", y = 'min_price_multiplier',
              color = 'day_of_week', height = 1300, facet_row = 'time_period_desc_fixed', 
             title = 'Distribution of Min. Spend minimum price multiplier per weekday')

fig3.update_layout(yaxis_range=[1,7])

fig3.show()

Observations:

  • We will focus on the 'Per evening' and 'Per day' time periods.
  • We can see that Monday and Tuesday have 0 width boxplots centered on a minimum price multiplier of 1. That means that over 75% of spaces do not increase their Monday prices (The monday price represents the lowest weekly price for 75% of spaces). There are still some outliers visible, so some spaces are increasing prices on these days, but the vast majority do not.
  • On Tuesday and Sunday we have boxplots whose lower quartile and median both sit on 0, meaning that at least 50% of the spaces have a minimum price multiplier of 1 on these days i.e. they do not increase prices. The Upper quartile range sits at about 1.6 - this means that, at most, 25% of the prices have been increased on these days up to a maximum increase of 60% i.e. 1.6 times larger than the lowest weekly price. The remaining 25% of the data must have a minimum price multiplier of 1.6 or higher, thus their price is more than 60% higher than the lowest weekly price. If we assume the whiskers define where the outliers begin, then we can set a maximum typical increase based on these of 2.5 - 2.9 meaning that the price can get as high as 2 or almost 3 times higher on these days (in non-exceptional cases).
  • Thursday, Friday and Saturday have full boxplots visible. More than 75% of spaces have a price increase on these days (up to the value of the lower quartile, typically about 1.5 or 50% increase). The medians sit at about 2 - 2.5 meaning 50% of prices are 2 - 2.5 times higher (at most) than the lowest weekday price. The upper quartile ranges sit at about 3.5 meaning 75% of prices increase by at most 3.5 times the lowest weekly price.
  • In terms of general trends / summaries, we see that Monday and Tuesday typically have the lowest weekly prices. For Wednesday and Sunday, some spaces do increase prices but the majority do not and the increase is of the order of about 60%. Then the highest prices are seen from Thursday to Saturday, where we see a general increase each day in typical increases in prices. The majority of the spaces are increasing the prices on these days, and we see prices raising typically to twice the lowest weekly price but as high as 6 times higher is not exceptional.
  • We see that the trends identified in our original boxplot have been verified (and refined) by this plot.
In [201]:
varying_prices_hire_fee = week_varying_prices[week_varying_prices.price_type == 'hire fee']

fig4 = px.box(varying_prices_hire_fee, x="price_type", y = 'min_price_multiplier',
              color = 'day_of_week', height = 1300, facet_row = 'time_period_desc_fixed', 
             title = 'Distribution of Hire Fee minimum price multiplier per weekday')

fig4.update_layout(yaxis_range=[1,3])

fig4.show()

Observations:

  • This time, we will focus on the 'Per day' time period.
  • In general, we are seeing lower price increases than in the 'min. spend' prices - with maximum boxplot whiskers of about 3 i.e. 3 times higher prices than the minimum weekly price.
  • Monday and Tuesday are 0 width boxplots - the vast majority (over 75%) of spaces do not increase prices on these days.
  • Tuesday, Wednesday and Thursday are boxplots where the bottom of the box is also the median. This means that the majority (at least 50%) of prices do not increase on these days. Between 25% - 50% (max) prices do increase above the lowest weekly price. For Wednesday and Tuesday this is very limited, with typically a 10% increase that may go up to about 30% max. On Friday the increase is a bit larger, going as high as between 30% - 80% typically.
  • Saturday is the only full boxplot - this means that the vast majority (at least 75%) of prices increase on this day. The typical increase (the median) is about 1.25 i.e. 25% increase which goes up to a maximum of over 2 i.e. twice as large.
  • The Sunday is a boxplot whose bottom sits on 1 - this means the first quartile (25%) of prices have a minimum price multiplier of 1 i.e. they dont increase prices on these days. The median sits at 1.2 and the upper quartile range is 1.3 and the whisker sits at 1.8. This suggest that between 50% - 75% of the data increases prices by about 20% up to a maximum typical value of 80%.
  • The typical trend is a gradual increase in prices over the week, with the highest prices being on Saturday.
  • Again, this confirms and refines our findings in our previous box plot of prices.

Conclusions¶

  • We found that the majority of prices on Tagvenue do not vary on different days of the week. We suspect that this is due to spaces wanting to showcase their cheapest rates or not wanting to spend the time putting in more accurate pricing information.
  • We identified and separated the prices that do vary across different weekdays. We found that 'min. spend' prices using the 'Per evening' time period had about 400 prices per weekday, which is a fairly good number to draw conlusions from. We found the 'hire fee' prices using 'Per day' had half as many (~200) observations per weekday which is a bit limited and causes us to be more wary of any trends we identify.
  • We were able to find and visualise typical weekday trends for both 'hire fee' and 'min. spend' prices'. We found that 'min. spend' prices tended to increase more over the week than 'hire fee'. Both types of pricing showed a trend of increasing prices through the week with maximum prices on weekends. The specifics of the trends were quite different between our 2 price types.
  • 'Min. Spend' prices would tend to peak on Fridays and Saturdays, with typical increases at the peaks of 2 times i.e. twice the lowest weekly price and maximum typical increases of 6 times the lowest weekly price.
  • 'Hire fee' prices tended to peak on Saturdays, with typical increases at the peaks of 1.25 times the lowest weekly price and as high as 2 times the lowest weekly price.

Create Datasets For Modeling¶

Dealing with Day_of_week¶

After exploring the price variations according to the day_of_week variable, we found some interesting trends. However, we have a relatively small sample of data for day of week variations (~400 for 'min. spend' and ~200 for 'hire fee'). As such, we have decided that initially we will not attempt to incorporate day_of_week variation in our model. We may use a rough average (e.g. double the price on a Saturday) instead. We may revisit day_of_week variation after building a simpler model first.

For the spaces which have multiple prices that vary on different days of the week, we will need to select a single price that will be our target variable. We have decided that this price will be the minimum weekday price. We chose this price because:

  • All prices on Tagvenue.com are described as minimum prices e.g. 'hire fee from £400'.
  • We suspect that if a space has supplied a single price for every day of the week, they would probably have provided their minimum price to make themselves look affordable and appealing to customers.

Below, we select the minimum weekday price as the price for every price category. We do this by grouping all prices by space_url, time_period_desc_fixed and price_type. Each group will have a up to 7 prices, one for each day of the week. We then take the minimum price within each group.

In [202]:
min_weekly_prices = prices_cleaned.groupby(['space_url', 
                                            'time_period_desc_fixed', 
                                            'price_type'], 
                                           as_index = False)['total_price'].min()
In [203]:
min_weekly_prices.head()
Out[203]:
space_url time_period_desc_fixed price_type total_price
0 https://www.tagvenue.com/rooms/barnes/7039/the... Per morning per person 20.0
1 https://www.tagvenue.com/rooms/barnes/7040/the... Per morning per person 20.0
2 https://www.tagvenue.com/rooms/barnes/7041/the... Per morning per person 20.0
3 https://www.tagvenue.com/rooms/barnet/8642/go-... Per day per person 33.0
4 https://www.tagvenue.com/rooms/bayswater/3064/... Per day min. spend 2500.0

Below we visualise the count of prices per price category after aggregating the day_of_week prices into the minimum week day price. Note, this will probably have increased the total number of prices per price category because if a space is not open on certain days of the week, then it will only have contributed to price categories for the day its open. This would be true for many different spaces all open on different days of the week and hence all days of the week would have a count of prices lower than the number of spaces. Now that we have aggregated over all days of the week, every space will contribute to its price category.

In [204]:
px.histogram(min_weekly_prices, x="price_type", color = 'time_period_desc_fixed', 
             barmode='group', height = 400, 
             title = 'Count of Prices per Price Category (After Aggregating Day Of Week into minimum weekly Price)')

Observations:

  • We see all price category counts have gone up as anticipated.
  • The 'min. spend - Per evening' price category has increased substantially! It is now over 1000! This suggest lots of 'Per evening' min. spend spaces were open on different days of the week, an interesting find.

Filtering on data we will model¶

We will now drop the 'per person', 'hire fee per hour' and combined price type categories, because we are not modeling them.

In [205]:
# Filter on price types = 'min. spend' and 'hire fee' only
min_weekly_prices = min_weekly_prices[min_weekly_prices.price_type.isin(['hire fee', 'min. spend'])]

px.histogram(min_weekly_prices, x="price_type", color = 'time_period_desc_fixed', 
             barmode='group', height = 400,
             title = 'Count of Prices per Price Category (After filtering on hire fee and min. spend)')

We will also drop the 'NA' time_period_desc_fixed entries.

In [206]:
min_weekly_prices = min_weekly_prices[min_weekly_prices.time_period_desc_fixed != 'NA']

px.histogram(min_weekly_prices, x="price_type",
                   color = 'time_period_desc_fixed', barmode='group', height = 400,
                  title = 'Count of Prices per Price Category (After removing NA time period)')

Define Datasets¶

Based on all the analysis and exploration we have done so far, we are quite confident and happy with the data we have narrowed our prices down to, namely the 'hire fee' and 'min. spend' prices with the 'per session' time period manually corrected to the other time periods and the aggregation of day of week prices down to a minimum weekly price.

We are far less comfortable with how to deal with the time_period_desc_fixed definition of price categories. In particular, we are not sure if we need to model the price categories separated by time_period_desc_fixed in separate models e.g. 'min.spend - per morning' must be modeled separately to 'min. spend - per day'. The basic tradeoff here is that if we model separately, we have fewer observations per model. If we model together we risk combining observations which are not able to predict one another and hence diluting the effectiveness of our model.

We have decided that we need to make a decision on this, and so we will build the following datasets:

  • hire_fee_data - This dataset will be used to model the 'Hire Fee' prices. It will contain prices with price_type = hire fee and time_period_desc_fixed = Per day. We chose to only model the 'Per day' time period in isolation because it has over 1500 observations and the boxplots / distribution of 'hire fee' prices for different time periods look quite different and hence may not be suitable to be modeled together.
  • min_spend_data - This dataset will be used to model the 'min. spend' prices. It will contain prices with price_type = min. spend and time_period_desc_fixed = Per day or Per evening. We chose to combine the 'Per day' and 'Per evening' prices because their price boxplots look very similar, suggesting they can be modeled together and predict one another. We don't know the impact of keeping in the rows with multiple prices per space (i.e. if the same space has a 'Per evening' price and a different 'Per day' price). We also do not know if we can or should aggregate over these prices e.g. take the average. As such, we decided to just drop them.
  • all_data - This dataset will contain all hire fee and min. spend prices for all time_period_desc_fixed available (apart from NA). This dataset will allow us to further test other options such as leaving in rows which have multiple time periods and modeling some prices separately.

The plan is to build models for 'Hire fee' and 'min. spend' using their respective datasets as defined above. This way we get a rudimentary model for 'hire fee' and 'min. spend' quite quickly. We can then use the all_data dataset to remodel 'hire fee' and 'min. spend' using different combinations or separations of the prices based on the time_period_desc_fixed column.

Build Hire Fee and Min. Spend Datasets¶

Below we create the 'hire fee' and 'min. spend' price datasets as described above:

In [207]:
# Filter on time period = 'per day' and price type = 'hire fee' 
hire_fee_prices = min_weekly_prices[(min_weekly_prices.time_period_desc_fixed == 'Per day') 
                                  & (min_weekly_prices.price_type == 'hire fee')]
# Filter on time period = 'per day' or 'per evening' and price type = 'min. spend' 
min_spend_prices = min_weekly_prices[(min_weekly_prices.time_period_desc_fixed.isin(['Per day', 
                                                                                     'Per evening'])) 
                                   & (min_weekly_prices.price_type == 'min. spend')]

We need to drop any 'min. price' prices which have 2 conflicting prices (both a 'per evening' and 'per day' price). Firstly, we will calculate how many spaces have this issue:

In [208]:
# group by space_url which is equivalent to group by space
min_spend_grouped_by_space = min_spend_prices.groupby('space_url',
                                                    as_index = False)
# Count number of unique prices per group
num_prices_per_space = min_spend_grouped_by_space['total_price'].nunique()
# Rename 'total_price' column to reflect its now a count of prices
num_prices_per_space.rename(columns= {'total_price':'num_prices'},
                            inplace = True)
num_prices_per_space['num_prices'].value_counts()
Out[208]:
1    1377
2     124
Name: num_prices, dtype: int64

124 spaces have this issue. We will drop these spaces, leaving 1377 spaces for modeling 'min. spend'.

In [209]:
# Filter on space_urls that have only 1 unique price
spaces_to_be_kept = (
    num_prices_per_space[num_prices_per_space.num_prices == 1].space_url
)
# Filter min_spend_prices on the above space_urls (only spaces with 1 price)
min_spend_prices = min_spend_prices[min_spend_prices.space_url.isin(spaces_to_be_kept)]
# Drop any rows with duplicate space_url (these are rows which had the 
# same price for 'Per day' and 'Per evening', thus they had 2 rows in the data). 
min_spend_prices.drop_duplicates(['space_url'], inplace=True)
min_spend_prices
/var/folders/54/vflgkhrd7l9d53hm444nhsp40000gn/T/ipykernel_94671/515266920.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[209]:
space_url time_period_desc_fixed price_type total_price
4 https://www.tagvenue.com/rooms/bayswater/3064/... Per day min. spend 2500.0
22 https://www.tagvenue.com/rooms/croydon/9082/mr... Per evening min. spend 500.0
25 https://www.tagvenue.com/rooms/croydon/9306/mr... Per evening min. spend 2000.0
51 https://www.tagvenue.com/rooms/london-wc1e-6jl... Per evening min. spend 2000.0
52 https://www.tagvenue.com/rooms/london-wc1e-6jl... Per evening min. spend 500.0
... ... ... ... ...
7694 https://www.tagvenue.com/rooms/sevenoaks/23641... Per day min. spend 5000.0
7696 https://www.tagvenue.com/rooms/shoreditch/5316... Per evening min. spend 750.0
7698 https://www.tagvenue.com/rooms/shoreditch/5395... Per day min. spend 2000.0
7701 https://www.tagvenue.com/rooms/south-woodford/... Per day min. spend 1000.0
7719 https://www.tagvenue.com/rooms/surrey/1884/san... Per day min. spend 5000.0

1377 rows × 4 columns

Merge Prices Data with Spaces Data¶

We will now merge the different prices datasets with the venues dataset - this will combine our target variable ('total_cost') with the space specific data (e.g. area, location, facilities etc.). Note that when we merge the 'hire fee' and 'min. spend' prices with the space data, we use a 1:1 validation. This is because when defining these 2 price datasets, we specifically ensured that each space only contributed one price (For 'hire fee' its the 'Per day' price. For 'min. spend' its either the 'Per evening or 'Per day' price. If a space has 2 different prices, one for 'Per day' and one for 'Per evening', then it was dropped). Thus, each space only appears once in the prices data and in the venue data. For the all_data dataset, we have kept all time_period_desc_fixed prices available, which means some spaces will have multiple prices for different time_period_des_fixed and hence will appear multiple times in the all_data dataset. Thus, the validation is 'many:1'.

In [164]:
hire_fee_data = pd.merge(hire_fee_prices, venues,
                         on = 'space_url',  
                         how = 'left', 
                         validate='1:1')

min_spend_data = pd.merge(min_spend_prices, venues,
                          on = 'space_url',  
                          how = 'left', 
                          validate='1:1')

all_data = pd.merge(min_weekly_prices, venues,
                    on = 'space_url',  
                    how = 'left', 
                    validate='m:1')

Save Data¶

We now save the 3 datasets separately.

Note - please update the save date to avoid over-writing data. Also note, we are using the date that the data was scraped rather than todays date.

In [118]:
date = 'add_date_here'

hire_fee_data.to_csv('../data/datasets_for_modeling/hire_fee_data_' + date + '.csv', index = False)
min_spend_data.to_csv('../data/datasets_for_modeling/min_spend_data_' + date + '.csv', index = False)
all_data.to_csv('../data/datasets_for_modeling/all_hire_fee_and_min_spend_data_' + date + '.csv', index = False)